An Image Classification Method Based on Adaptive Attention Mechanism and Feature Extraction Network

The convolution neural network (CNN) not only has high fault tolerance but also has high computing capacity. The image classification performance of CNN has an important relationship with its network depth. The network depth is deeper, and the fitting ability of CNN is stronger. However, a further increase in the depth of CNN will not improve the accuracy of the network but will produce higher training errors, which will reduce the image classification performance of CNN. In order to solve the above problems, this paper proposes a feature extraction network, AA-ResNet with an adaptive attention mechanism. The residual module of the adaptive attention mechanism is embedded for image classification. It consists of a feature extraction network guided by the pattern, a generator trained in advance, and a complementary network. The feature extraction network guided by the pattern is used to extract different levels of features to describe different aspects of an image. The design of the model effectively uses the image information of the whole level and the local level, and the feature representation ability is enhanced. The whole model is trained as a loss function, which is about a multitask problem and has a specially designed classification, which helps to reduce overfitting and make the model focus on easily confused categories. The experimental results show that the method in this paper performs well in image classification for the relatively simple Cifar-10 dataset, the moderately difficult Caltech-101 dataset, and the Caltech-256 dataset with large differences in object size and location. The fitting speed and accuracy are high.


Introduction
Image classifcation is an image processing technology that extracts efective features from the most original image data and then achieves the goal of distinction of diferent categories of objects in the image. Now, image classifcation is one of the most important research hotspots in the feld of computer vision. To realize image classifcation with traditional methods, it must be realized in two stages. First, the unstructured data of the original image data must be converted into structured data used to represent features. Ten, the structured feature data must be input into a trainable classifer. Finally, the classifcation results can be obtained [1,2]. However, the traditional image classifcation methods have some shortcomings in that the accuracy of image classifcation depends on the efectiveness of feature extraction to a large extent [3]. Sometimes, there are some features without discrimination in the image that may have a great negative impact on the classifcation results. However, the method based on CNN does not need complicated feature extraction and can directly integrate the feature analysis of image data into the CNN model. Only by adjusting the weight and ofset of the network model, the efective distinction of image features can be quickly realized [4,5].
In traditional methods, image classifcation often needs to work in stages. Te frst step is feature learning and then feature coding. Tere are also spatial constraints, of course, the classifer is indispensable, and the process is complex [6,7]. Tis traditional method may be efective for some simple image classifcation tasks, but if the traditional classifcation method is applied to the actual complex scenes, the results may not be too optimistic. Terefore, researchers began to try using convolutional neural networks to solve the task of image classifcation, using nonhuman operation methods to extract features from images and analyze the sample data in some aspects, to achieve the purpose of classifying specifc targets into a certain label in a known mixed category [8]. However, when using CNN model for image classifcation, the image classifcation performance of CNN has a very important relationship with the depth of the network model. Te network depth is deeper, and the ftting ability of the CNN is stronger. However, with the increasing depth of the network model, the classifcation accuracy of the model is not improved, the gradient disappears, and the network model produces higher errors [9,10]. In view of the above problems, the ResNet model can solve the problem of performance degradation of the model under deep conditions to a certain extent [11]. Recent research shows that the performance of convolutional neural networks can be improved using cross-layer connections. Te typical residual network (ResNet) uses this idea to achieve a very good image recognition efect through the identity mapping method. However, in the residual module, the layout of cross-layer connecting lines does not reach the optimal setting, resulting in the redundancy of information and waste of layers [12,13].
Te image classifcation performance of CNN has an important relationship with its network depth. Te network depth is deeper, and the ftting ability of CNN is stronger. However, a further increase in the depth of the CNN will not improve the accuracy of the network but will produce higher training errors, which will reduce the image classifcation performance of the CNN. In order to solve the above problems, this paper proposes a feature extraction network, AA-ResNet, with an adaptive attention mechanism. Te residual module of the adaptive attention mechanism is embedded for image classifcation.

Related Work
In recent years, with the advent of the big data era and the development of computer hardware conditions, people have started a more comprehensive and in-depth study of deep learning. Trough a large number of theories and experiments, it is proved that the convolutional neural network in deep learning plays an irreplaceable role in the feld of image classifcation. It can obtain image features through a large number of sample training, and the designed classifer is closely related to the extracted features [14]. Compared with traditional classifcation algorithms, convolutional neural networks can automatically learn the features of image data, and people do not need to spend a lot of experience manually extracting image features. Tis has a signifcant efect on image classifcation tasks with large sample sizes and small diferences between categories [15].
Te extremely deep convolutional neural network will not only cause the gradient disappearance but also increase the risk of network overftting, which will afect the accuracy of image classifcation of the network model. Te initial research method is mainly to solve the problem of network gradient disappearance by initializing the network model and conducting hierarchical training. At present, deep convolution neural networks mostly use the activation function ReLU to alleviate the problem of gradient disappearance. Compared with the sigmoid function, the ReLU function is more efective in alleviating the gradient disappearance [16]. Te direct supervision method can be used to train the deep CNN model, but in the process of training the deep CNN model, once some inputs of the activation function ReLU enter the hard saturation region, the corresponding weights cannot be updated quickly [17]. Te appearance of the neuron death phenomenon will make CNN difcult to converge. Terefore, many new activation functions have been proposed one after another, for example, the PReLU function introduces additional parameters to improve its performance. Another example is the ELU (exponential linear unit) function, which combines the Sigmoid and ReLU functions. Because of its left side saturation, the gradient disappearance problem of the network model can be greatly alleviated [18,19]. Another example is the PELU (parametric exponential linear unit) function, which is to add parameters to the ELU function and controls the performance of the activation function by training the network model and updating the parameters of the network model in time to better control the ofset drift and gradient disappearance during the training of the network [20,21]. Te depth of the CNN is further increased, but the accuracy of the model is not improved. On the contrary, the phenomenon of the gradient disappearing occurs, which results in a higher error. For the above problems, the ResNet model is better than an ordinary convolutional neural network in terms of convergence and classifcation performance [22]. Te main innovation of the ResNet model is to introduce a residual structure into the network, which can quickly transfer the results of the previous layer to the next layer network. When the network model is further deepened, the error will not continue to increase, which makes the ResNet model have more layers and more accurate accuracy [23,24]. With the increasing depth of the network model, the accuracy of the network model is gradually far from that of the ordinary network model. Obviously, this solves the problem of degradation of the image classifcation performance of the network model under the condition of depth to a certain extent [25].

Detailed
Steps of Residual Block. Te detailed steps of selecting the residual block based on the attention mechanism can be summarized as follows: Step 1. For the intermediate feature map in the neural network layer X 0 ∈ R H×W×C , to reduce the parameters and calculation amount, it is sent to the convolution kernel with a size of 1 × 1 to obtain the feature map X 1 ∈ R H×W×C 1 , where H × W represents the spatial dimension of the corresponding feature map C and C 1 representing the channel dimension of the corresponding feature map; 2 Computational Intelligence and Neuroscience Step 2. Group convolution is used to perform a channelbased grouping operation on the feature map X 1 to obtain a plurality of subfeature maps with the same dimension, X 1 � x 1 · · · , grop , x i ∈ R H×W×C 1 /group , where group represents the number of subfeature maps, i � 1, · · · , group , and x i represents the i th sub feature map; Step 3. Te specifc operations of the spatial group enhanced attention transformation adopted for each subfeature map x i are as follows: (1) For all the obtained subfeature maps x i ∈ R m×c , the global average pooling operation F gp (•) based on space is performed to obtain the global semantic vector g ∈ R l×c of the subfeature map x i , where m � H × W represents the spatial dimension of the subfeature map and c � C 1 /group represents the channel dimension of the subfeature map; (2) Use the global semantic vector of the subfeature map x i and multiply it with each subfeature map point to obtain the importance coefcient c i ∈ R m×l corresponding to each subfeature map; (3) Each importance coefcient c i is normalized in the spatial dimension to obtain c i ′ ; (4) Scale and translate each of the standardized importance coefcients c i ′ to obtain a i ; (5) Te newly generated importance coefcient a i generates a spatial enhanced feature map x i ′ for each subfeature map through a Sigmoid function σ(•) combined with subfeature maps x i corresponding a i ; (6) Te binding step (5) was formulated as the spatial enhancer feature map x i ′ , resulting in the feature maps X i ′ � x i ′ · · · , group ′ , x i ′ ∈ R H×W×C 1 /group ′ , where group ′ represents the number of spatial enhancer feature maps and i � 1, · · · , group ′ , and x i ′ represent the ith post enhancer feature map; (7) Feed X i ′ into the convolutional layer in which the convolution kernel was 1×1 and perform ascending dimension manipulation, obtaining the feature map X 0 ′ ∈ R H×W×C which has the same dimension of X 0 , where H × W representing the spatial dimension of the feature map X 0 ′ and C representing the channel dimension of the feature map X 0 ′ ; (8) Formulation of step (7) as an intermediate feature map X 0 combined with the newly derived feature map X 0 ′ , thus they yield the output feature map X 0 ∈ R H×W×C for this spatially grouped attention module, where H × W represents the spatial dimension of feature Fig. X 0 and C represents the channel dimension of the feature map X 0 ; Intermediate feature maps obtain an enhanced attentional feature map when they go through the residual blocks of the middle layers of the network from the above steps. Te feature maps were augmented to diferent degrees by stacked residual blocks. Similarly, with the aid of groupwise convolution, the width of the network is boosted, further enhancing the semantic information of the feature maps. Te classifcation accuracy of the network model can be greatly improved by the network model [26,27]. Terefore, this paper chose the ResNet101 residual network model as the benchmark model, and in the next step, attentional mechanisms will be incorporated into the ResNet101 model to further improve the classifcation accuracy.

Residual Module Improvement.
Te residual network can deepen the network without reducing the efciency of the model, and its schematic is shown in Figure 1.
A neural network with the input of x and the output of α [l] . Adding new two layers to the end, the output becomes α [l+2] . Tese two layers that are newly added are turned into residual blocks. Te network is activated using the ReLU activation function, and none of the activation values is less than 0. Te value of α [l+2] can be represented by (1): (1) Of these, f was the activation function; z [l+2] was the unactivated network output for the layer l + 2; α [l] was the identity mapping of layers 11 to 22. From (1), it is guaranteed that α [l] is the same dimension as z [l+2] to be computable. (1) can also be unfolded, as shown in (2): (2) , since the entire network uses ReLU activation, α [l+2] � α [l] . Tat is, the jump connection can make it easy for the network to learn α [l+2] � α [l] . Tat is to say, even if the entire network adds two layers, the learning efciency of the network will not decrease.
It is easy to learn the identity function of the network. Even if there are two more layers, α [l] can be assigned to α [l+2] through identity mapping. Tat is, the network performance will not be afected, and the network efciency can be improved. In practice, ω [l+2] will not be 0. Tat is, new information can be learned through the residual block. Networks are better than just learning identity functions. Te ordinary network without residual block is difcult to learn the identity connection, and it will not be better than before. Terefore, creating residual blocks is advantageous to improve network efciency. Te parameters of the network model can be greatly reduced. Its principle is to reduce the dimension frst and then increase the dimension by using the convolution kernel of size. In order to better extract the features of the image, the attention mechanism module is integrated into the original residual unit to form a new residual structure. Te specifc integration position is shown in Figure 2.
Although the attention mechanism will add some computational overhead, it is very benefcial to improve the classifcation accuracy of the network model. Figure 2 shows a feature map with an input of 64 × 56 × 56, and the calculation process of the attention weight of each channel is generated through the residual unit. Te feature map with the input of x and the size of 64 × 56 × 56 will frst pass through three two-dimensional convolutions. Tese two-Computational Intelligence and Neuroscience dimensional convolution layers play a role in reducing the number of parameters. Te frst two convolution layers are activated by the ReLU function. Te left side shows the relevant parameters of the step size, convolution kernel size, and flling value. After three convolutions and two activation operations, the output feature map size is 256 × 56 × 56.
Secondly, when entering the attention mechanism module, the feature map output in the previous step will frst undergo adaptive global average pooling to obtain an output feature map with a size of 256 × 1 × 1; after removing one dimension and exchanging the positions of the remaining two dimensions, one-dimensional convolution is performed. In this step, the convolution kernel size is 3, the step size and flling are 1, and the feature map size is 1 × 256. By exchanging and expanding the dimensions of the feature map by one dimension, then Sigmoid activation is performed to obtain 256 weight values representing the weights of each dimension. Te 256 weight values are multiplied by the initial feature information to obtain a new output feature map whose size is 256 × 56 × 56. Te above is the process of inputting x and obtaining F x ( ). Ten, identity mapping is performed. In order to ensure that the feature map with input 11 and the feature map with output 22, feature maps have the same dimension and can be added, the dimension of x must be increased frst, as shown on the right side in Figure 2, so that the channel dimension becomes 256. Te output H x ( ) � F x ( ) + x of the residual unit can be obtained.

Te Setting of Model Parameters.
When training the AA-ResNet model in this paper, the hyperparameters that need to be set mainly include the size of batch training, the size of the learning rate, the selection of classifcation number, and the weight decay rate. Te size of batch training determines the descending direction of the AA-ResNet model. When the dataset is large enough, the size of batch training should be appropriately reduced to greatly reduce the calculation amount. If the data amount is small and there is noise data, the batch training should be set to a large value to reduce the interference of noise data. When batch training reaches a certain value, the AA-ResNet model is optimal in terms of training time and convergence accuracy. Te magnitude of weight update is closely related to the learning rate, so it is very benefcial to set the learning rate in an appropriate range to reduce the gradient of the AA-ResNet model to the optimal value. If the learning rate is set too large, the weight of the AA-ResNet model will exceed the optimal value and then swing back and forth at the end with a small error. However, if the learning rate is set too small, the optimization of the AA-ResNet model will require a lot of time, and even the model may not converge. Te initial learning rate of the AA-ResNet model in this paper is set to 0.1. However, with the increase in the number of iterations of the model, it is gradually adjusted to 1/10000, to improve the accuracy of the AA-ResNet model while obtaining a faster training speed. In the training process, overftting often occurs. Te greater the weight of the AA-ResNet model, the greater the risk of overftting. In order to reduce the risk of overftting, a penalty term is added to the error function. Te weight decay rate is the main parameter for calculating the regularization of L 2 . Te main function of the weight decay rate is to adjust the infuence of the complexity of the AA-ResNet model on the loss function. Te regularization of L 2 can obtain parameters with small values to reduce the risk of model overftting.

Model Evaluation Index.
In this paper, the evaluation indicators of the model include training time, train acc, test acc, loss value parameters, drawing confusion matrix, accuracy, and loss value change line graph, where train acc refers to the proportion of the number of correctly predicted samples in the training set to the total number of samples in the training set. In semantic segmentation, it refers to the proportion of correctly predicted pixels in the training set to the total number of pixels. Test acc refers to the proportion of the number of correctly predicted samples in the test set to the total number of samples in the test set. In segmentation, it refers to the proportion of correctly predicted pixels in the test set to the total number of pixels. Te loss value refects the degree to which the predicted value of the model is diferent from the real value. Calculate the cross entropy loss function of the loss value. For sample i, construct a vector y (i) ∈ R q so that the y (i) element (discrete value of the category of sample i) is 1 and the rest is 0 to represent the real label; y (i) denotes a probability distribution predicted by the model; Θ denotes model parameters. Training time

Cifar-10 Data Set Experimental
Results. Te Cifar-10 data set is a relatively easy data set for image classifcation. One reason is that the number of images available for training in the Cifar-10 data set is huge, with a total of 60000 images, and 50000 images were used in the experiment. Second, there are only ten categories due to fewer categories. According to the theoretical research and inference of existing algorithms, after the training of the convolutional neural network model, higher accuracy can be obtained. Tis is supported by the data in Table 1. It can be seen from Table 1 that the accuracy of the fve models on the Cifar-10 data set is high and the diference is small. Te accuracy of the training set was more than 95%, and the accuracy of the test set was about 90%. From the data, we can fnd that the model in this paper has the best ftting for the dataset. Te accuracy of the training set and the test set is the highest among all models, which are 99.96% and 92.43%, respectively. Te loss value was the lowest, only 0.00129.
It can be observed from Figure 3 that the loss values of the fve models gradually decrease from the interval of 1.5 to 2 to about 0. Among them, the cyan curve of the AA-ResNet model is located at the bottom as a whole, the loss value is close to 0 at about 8 epochs, and the ftting speed is fast.
Te accuracy broken line graph in Figure 4 refects the changes in the training set accuracy rate train acc and the test set accuracy rate test acc as the number of epochs increases. It can be found that the accuracy starting point range of the training set and test set of AlexNet, VggNet, and GoogLeNet models is relatively consistent, between 20% and 40%, while the starting point range of AA-ResNet is relatively high, between 50% and 60%, and the initial ftting is relatively good. At the same time, it can also be seen that the accuracy rate of AA-ResNet has a large initial discount slope, which indicates that the accuracy rate is greatly improved with less epoch numbers, it is also a performance of excellent performance. It was also found that when the accuracy of the training set tends to be stable, the accuracy of the test set does not change.
In terms of data parameters and overall mapping, the classifcation results of the fve models are excellent in the Cifar-10 dataset experiment. Tis is not only because the classifcation difculty of the Cifar-10 data set is low but also refects the superiority of convolutional neural networks in the feld of image classifcation. However, there are small diferences between the models. AA-ResNet has the best performance, high accuracy, and fast ftting speed in the Cifar-10 data set. Te comprehensive performance of the other models is relatively consistent.

Caltech-101 Data Set Experimental
Results. Te classifcation difculty of the Caltech-101 data set is signifcantly higher than that of the Cifar-10 data set. Not only has the classifcation category increased from 10 to 101 but also the  Computational Intelligence and Neuroscience number of categories has increased 10 times. Moreover, the number of images available for training has been reduced from 60000 to 9000, which is a great challenge to the convolutional neural network model. Generally speaking, the accuracy of image classifcation will be signifcantly reduced, and the diference between models will be more obvious. Table 2 supports this conjecture.
Te data obtained in Table 2 are the results of the training Caltech-101 training set with 50 epochs. Due to the increasing difculty of image classifcation in the Caltech-101 dataset, the classifcation accuracy of the test set is signifcantly lower than that of the Cifar-10 dataset. It can be found that the accuracy of the fve models on the Caltech-101 test set is about 60%-70%, and the diference is also highlighted. At the same time, it can also be found that the convolution neural network can ft the training set of the dataset well, and the accuracy of the model training set is more than 98%, but the accuracy of the test set is diferent.   Computational Intelligence and Neuroscience AA-ResNet has also reached about 70%, which is the most accurate of several models. Moreover, due to the reduction in the number of images, the training time of a single epoch decreased signifcantly, from 253.71 seconds to 55.82 seconds, the number of training images decreased 3 times, and the training time decreased 4 times.
In Figure 5, the starting position of the loss value of the fve models is near 4, which is higher than that of the previous Cifar-10 experiment. Among them, the loss value of AA-ResNet is around 0 at 14 epochs, which is far ahead of the other models. Tis also proves that the performance of the convolutional neural network model in the Caltech-101 data set is not as good as that in the Cifar-10 data set, because there are not many bright blocks outside the diagonal in the confusion matrix of Cifar-10, and there are no black blocks on the diagonal. At the same time, it can also be found from the confusion matrix that the classifcation efect of each model on small classes is inconsistent, and the classes with good and poor image classifcation efects of each model are inconsistent, which also shows the diferences between models.
As shown in Figure 6, the Caltech-101 data set takes 50 epochs to make all models ft well, and the training set accuracy rate train acc reaches more than 95%. Te corresponding Cifar-10 data set only takes 30 epochs, which also proves that the Caltech-101 data set is more complex and difcult to ft. It can be seen from the fgure that AA-ResNet has a fast ftting speed and the accuracy rate of the training set is close to 100% when it is around 20 epochs. However, the GoogLeNet and DenseNet models fuctuate greatly. Even GoogLeNet has a large "V" shaped fuctuation, which is generally caused by a poor ftting efect. It can also be found that when the training set accuracy rate train acc and the test set accuracy rate test acc curves tend to be stable. Te two lines are separated by a large distance, which also indicates that the model has not achieved a particularly good efect.
On the whole, it can also be found that Caltech-101 is more complex. Te ftting efect of the model is lower than that of the previous Cifar-10 experimental model, and it shows diferences. AA-ResNet performs well as a whole, with small fuctuations and fast ftting speed.

Caltech-256 Data Set Experimental
Results. Te Caltech-256 data set is much more complex than the Cifar-10 and Caltech-101 data sets in the above experiments. It is also a data set that is difcult to classify in all image classifcation data sets. Te Caltech-256 dataset is composed of 256 categories, with a large number of categories, and the images in the dataset difer greatly in object size and position. All this means that the difculty of classifcation is increased, to explore the specifc performance of each network model for this complex situation and the diferences in the complex situation. Table 3 shows the experimental results of the Caltech-256 data set.
Te data in Table 3 are the results of the Caltech-256 training set with 50 epochs. It can be found that the accuracy rate is signifcantly lower than that of the above Cifar-10 and Caltech-101 data set experiments. Te accuracy rate of the test sets of the fve models is within the range of 30%-55%, and the diference between the models is greater. When the number of epochs trained in the Caltech-101 experiment is the same, the accuracy rate of the fve model testing sets is lower, of which the accuracy rate of AA-ResNet is about 50%. Since the number of images in the Caltech-256 dataset is three times that of the Caltech-101, the training time is also greatly increased. It can be seen that the training time is in direct proportion to the number of images.
In Figure 7, the loss value of the fve models gradually decreases from about 5 to about 0 with the increase in the epoch number. AA-ResNet frst approaches 0 at 18 epochs, which indicates that AA-ResNet is ahead of the other models. Te diagonal line of the AA-ResNet confusion matrix is brighter, so it is analyzed that the image classifcation efect of this model is better.
As shown in Figure 8, the accuracy starting points of the fve models are all around 0%. Tis phenomenon is caused by the complexity of the Caltech-256 data set. Consistent with the above experiments on Cifar-10 and Caltech-101 data sets, the ftting speed of AA-ResNet is the fastest among the fve models, and the ftting is completed in about 18 epochs, and the accuracy of the training set is nearly 100%. What is inconsistent with the above experiment is that the fuctuation of the model is relatively small, and the distance between the training set accuracy train acc curve and the test set accuracy test acc curve is larger. It can be seen that Caltech-256 is extremely complex and the ftting efect of the model is poor, but it shows a greater diference.

Computational Intelligence and Neuroscience
In general, among the fve models, the AA-ResNet proposed in this paper performs well. Te test set has the highest accuracy, and the classifcation efect of each class is better.

Conclusions and Future Work
In the experiment, we use the data sets of Cifar-10, Caltech-101, and Caltech-256 to test the superiority of the proposed AA-ResNet model compared with the AlexNet, VggNet, GoogLeNet, and DenseNet models, and verify that AA-ResNet has better performance in image classifcation. By comparing the training time, accuracy and loss value parameters of the fve models and drawing the confusion matrix, accuracy, and loss value change line graph, it is found that AA-ResNet has the best performance among the fve convolutional neural network models, and the ftting speed of AA-ResNet is faster.
In the future, we can continue to improve the accuracy of the model by increasing the layers of AA-ResNet and spending more time to adjust the parameters of the model. Due to the limitations of the hardware and software, the relevant data for the complete model will be supplemented in the subsequent experiments. Although the convolution neural network model is applied to image classifcation and recognition in this paper, the running state of the relevant model on the server is not evaluated. A convolutional neural network model can not only be applied in the feld of image recognition but also plays an important role in natural language processing. It needs further theoretical research and practical application.

Data Availability
Te authors confrm that the data supporting the fndings of this study are available within the article.

Conflicts of Interest
Te authors declare that they have no conficts of interest.